34 research outputs found
Real-time Online Video Detection with Temporal Smoothing Transformers
Streaming video recognition reasons about objects and their actions in every
frame of a video. A good streaming recognition model captures both long-term
dynamics and short-term changes of video. Unfortunately, in most existing
methods, the computational complexity grows linearly or quadratically with the
length of the considered dynamics. This issue is particularly pronounced in
transformer-based architectures. To address this issue, we reformulate the
cross-attention in a video transformer through the lens of kernel and apply two
kinds of temporal smoothing kernel: A box kernel or a Laplace kernel. The
resulting streaming attention reuses much of the computation from frame to
frame, and only requires a constant time update each frame. Based on this idea,
we build TeSTra, a Temporal Smoothing Transformer, that takes in arbitrarily
long inputs with constant caching and computing overhead. Specifically, it runs
faster than equivalent sliding-window based transformers with 2,048
frames in a streaming setting. Furthermore, thanks to the increased temporal
span, TeSTra achieves state-of-the-art results on THUMOS'14 and
EPIC-Kitchen-100, two standard online action detection and action anticipation
datasets. A real-time version of TeSTra outperforms all but one prior
approaches on the THUMOS'14 dataset.Comment: ECCV 2022; Code available at
https://github.com/zhaoyue-zephyrus/TeSTr
Training a Large Video Model on a Single Machine in a Day
Videos are big, complex to pre-process, and slow to train on.
State-of-the-art large-scale video models are trained on clusters of 32 or more
GPUs for several days. As a consequence, academia largely ceded the training of
large video models to industry. In this paper, we show how to still train a
state-of-the-art video model on a single machine with eight consumer-grade GPUs
in a day. We identify three bottlenecks, IO, CPU, and GPU computation, and
optimize each. The result is a highly efficient video training pipeline. For
comparable architectures, our pipeline achieves higher accuracies with
of the computation compared to prior work. Code is available at
https://github.com/zhaoyue-zephyrus/AVION.Comment: Tech report. Code is available at
https://github.com/zhaoyue-zephyrus/AVIO